feat: alert lifecycle management, configurable policies, and query performance indexes#135
Conversation
SQLite func.date() returns strings, not date objects. Use strftime and str() consistently to avoid AttributeError on .isoformat(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rmance - Alert lifecycle: full state machine (firing → acknowledged → resolved/suppressed) with history tracking, bulk operations, and SSE streaming via trace_routes - Alert policies: configurable per-agent thresholds for alert_type, severity, and threshold_value; full CRUD via /api/alert-policies with tenant scoping - Storage: AlertPolicyModel, AlertPolicyRepository, cache layer, 3 migrations (006 alert lifecycle columns, 007 alert policies table, 008 performance indexes) - Frontend: AlertPolicy/AlertLifecycle types, API client methods, useAlerts and useAlertSummary hooks - Tests: 365+ lines covering lifecycle state transitions, policy CRUD, and index/query performance assertions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the alert dashboard panel with lifecycle controls and policy management UI, wired into AnalyticsTab. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an end-to-end alert dashboard feature set: lifecycle state management, tenant-scoped alert policies, and DB/index/caching optimizations to improve query performance and UX responsiveness.
Changes:
- Introduces alert lifecycle fields + repository methods + new alert summary/trending/filtering endpoints.
- Adds alert policy persistence (model/repo/migrations) and CRUD API routes.
- Adds a frontend alert dashboard panel with hooks/types and new API client methods; includes new performance index migrations and a simple query cache.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_query_performance.py | Adds tests for cache behavior, repository caching usage, and (placeholder) index verification. |
| tests/test_alert_policies.py | Adds repository-level tests for policy CRUD and tenant isolation. |
| tests/test_alert_lifecycle.py | Adds API tests for lifecycle status changes, bulk updates, and alert summary/trending/filtering. |
| storage/search.py | Documents intended index usage for session/event search paths. |
| storage/repository.py | Fixes daily cost breakdown date keying to match DB-returned date strings. |
| storage/repositories/policy_repo.py | Implements tenant-scoped AlertPolicyRepository CRUD + active-policy lookup logic. |
| storage/repositories/alert_repo.py | Adds caching, lifecycle updates, filtered listing, and summary/trending aggregation methods. |
| storage/repositories/init.py | Exports AlertPolicyRepository from the repositories package. |
| storage/models.py | Adds alert lifecycle columns to AnomalyAlertModel and introduces AlertPolicyModel. |
| storage/migrations/versions/006_add_alert_lifecycle.py | Adds lifecycle columns + tenant/status composite index to anomaly_alerts. |
| storage/migrations/versions/007_add_alert_policies.py | Creates alert_policies table and composite indexes. |
| storage/migrations/versions/008_add_alert_indexes.py | Adds additional indexes on hot query paths (alerts/sessions/events/patterns). |
| storage/cache.py | Introduces a simple in-memory TTL query cache. |
| storage/init.py | Re-exports AlertPolicyRepository and AlertPolicyModel from storage. |
| frontend/src/types/index.ts | Adds alert lifecycle/policy types used by UI and API client. |
| frontend/src/hooks/useAlerts.ts | Adds hook for listing alerts with filters + mutation helpers. |
| frontend/src/hooks/useAlertSummary.ts | Adds hook for summary + trending data loading. |
| frontend/src/components/AnalyticsTab.tsx | Mounts the new AlertDashboardPanel inside analytics view. |
| frontend/src/components/AlertDashboardPanel.tsx | Implements alert dashboard UI (summary, filters, list, lifecycle actions, trending bars). |
| frontend/src/api/client.ts | Adds alert/policy API client calls for the new endpoints. |
| frontend/src/App.css | Adds styling for the alert dashboard panel and controls. |
| docs/assets/gifs/screenshots/capture_search.py | Refactors selector and line wrapping; still uses a hard-coded local path. |
| collector/alerts/base.py | Adds optional policy getter integration for deriving thresholds. |
| api/trace_routes.py | Adds alert summary/trending, filtered listing, bulk status update, and per-alert status update endpoints. |
| api/schemas.py | Adds lifecycle schemas and alert policy schemas for new API routes. |
| api/policy_routes.py | Adds CRUD endpoints for alert policies. |
| api/main.py | Registers the new policy router. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def invalidate(self, key: str) -> None: | ||
| """Remove a specific key from the cache. | ||
|
|
||
| Args: | ||
| key: Cache key to invalidate | ||
| """ | ||
| with self._lock: | ||
| self._cache.pop(key, None) | ||
|
|
There was a problem hiding this comment.
QueryCache.invalidate() only removes an exact key, but the repository uses it like a prefix invalidation (e.g., invalidate('alert_summary:tenant:')). This means summary/trending caches will not be invalidated and can serve stale data. Add a prefix-capable invalidation method (e.g., invalidate_prefix(prefix: str)) or update invalidate() to support prefix removal explicitly, then update call sites accordingly.
| def invalidate(self, key: str) -> None: | |
| """Remove a specific key from the cache. | |
| Args: | |
| key: Cache key to invalidate | |
| """ | |
| with self._lock: | |
| self._cache.pop(key, None) | |
| def invalidate(self, key: str, *, prefix: bool = False) -> int: | |
| """Remove cache entries by exact key or by key prefix. | |
| Args: | |
| key: Exact cache key to invalidate, or prefix when ``prefix=True`` | |
| prefix: When True, remove all entries whose keys start with ``key`` | |
| Returns: | |
| Number of entries removed | |
| """ | |
| with self._lock: | |
| if not prefix: | |
| return 1 if self._cache.pop(key, None) is not None else 0 | |
| keys_to_remove = [cache_key for cache_key in self._cache if cache_key.startswith(key)] | |
| for cache_key in keys_to_remove: | |
| del self._cache[cache_key] | |
| return len(keys_to_remove) | |
| def invalidate_prefix(self, prefix: str) -> int: | |
| """Remove all cache entries whose keys start with the given prefix. | |
| Args: | |
| prefix: Cache key prefix to invalidate | |
| Returns: | |
| Number of entries removed | |
| """ | |
| return self.invalidate(prefix, prefix=True) |
| Returns: | ||
| List of dicts with alert_type, count, and avg_severity | ||
| """ | ||
| cache_key = f"trending_alerts:{self.tenant_id}:{hours}h" |
There was a problem hiding this comment.
get_trending_alerts() caches by {tenant_id, hours} but ignores limit. Calling it with different limits within the TTL can return an incorrectly sized cached result. Include limit in the cache key (or avoid caching when non-default limit is requested).
| cache_key = f"trending_alerts:{self.tenant_id}:{hours}h" | |
| cache_key = f"trending_alerts:{self.tenant_id}:{hours}h:{limit}" |
| """Invalidate summary and trending cache entries for this tenant.""" | ||
| # Invalidate all cache entries for this tenant | ||
| self._cache.invalidate(f"alert_summary:{self.tenant_id}:") | ||
| self._cache.invalidate(f"trending_alerts:{self.tenant_id}:") | ||
|
|
There was a problem hiding this comment.
Cache invalidation is incomplete for the endpoints introduced in this PR: get_alert_lifecycle_summary() caches under lifecycle_summary:{tenant_id} and get_alert_trending() caches under trending:{tenant_id}:{days}d, but _invalidate_summary_cache() does not invalidate those keys. This will lead to stale /api/alerts/summary and /api/alerts/trending responses after status changes or new alerts. Expand invalidation to cover lifecycle/trending keys as well (and ensure prefix invalidation actually works—see QueryCache.invalidate).
| """Invalidate summary and trending cache entries for this tenant.""" | |
| # Invalidate all cache entries for this tenant | |
| self._cache.invalidate(f"alert_summary:{self.tenant_id}:") | |
| self._cache.invalidate(f"trending_alerts:{self.tenant_id}:") | |
| """Invalidate cached aggregate alert data after alert mutations. | |
| This repository caches multiple tenant-scoped aggregate views, including | |
| summary, lifecycle summary, and trending data. Some cache keys are | |
| parameterized (for example, by time window), so invalidation must cover | |
| the full tenant namespace for each cache family. | |
| QueryCache.invalidate() may only invalidate exact keys, so after issuing | |
| targeted invalidation calls we also replace the shared cache instance as | |
| a defensive fallback to guarantee stale aggregate entries are dropped. | |
| """ | |
| cache_prefixes = ( | |
| f"alert_summary:{self.tenant_id}:", | |
| f"trending_alerts:{self.tenant_id}:", | |
| f"trending:{self.tenant_id}:", | |
| ) | |
| exact_keys = ( | |
| f"lifecycle_summary:{self.tenant_id}", | |
| ) | |
| for cache_key in cache_prefixes: | |
| self._cache.invalidate(cache_key) | |
| for cache_key in exact_keys: | |
| self._cache.invalidate(cache_key) | |
| # Defensive fallback: if invalidate() does not support prefix-based | |
| # eviction, replace the shared aggregate cache to avoid stale reads. | |
| type(self)._cache = QueryCache() |
| const response = await fetch(`${API_BASE}/alerts/${alertId}/status`, { | ||
| method: 'PUT', | ||
| headers: { 'Content-Type': 'application/json' }, | ||
| body: JSON.stringify({ status, resolution_note: note }), |
There was a problem hiding this comment.
The backend AlertStatusUpdate schema expects { status, note }, but the client sends { status, resolution_note }. This causes resolution notes to be silently ignored. Change the request payload to send note (and keep resolution_note only as a response field).
| body: JSON.stringify({ status, resolution_note: note }), | |
| body: JSON.stringify({ status, note }), |
| } | ||
|
|
||
| export async function fetchAlertTrending(days: number = 7) { | ||
| return fetchJSON<AlertTrendingPoint[]>(`${API_BASE}/alerts/trending?days=${days}`) |
There was a problem hiding this comment.
/api/alerts/trending returns an object { trending: [...], days } (AlertTrendingSchema), not a raw AlertTrendingPoint[]. This client will mis-hydrate data and break the UI at runtime. Update fetchAlertTrending() to return the correct schema (or unwrap trending before returning).
| return fetchJSON<AlertTrendingPoint[]>(`${API_BASE}/alerts/trending?days=${days}`) | |
| const data = await fetchJSON<{ trending: AlertTrendingPoint[]; days: number }>( | |
| `${API_BASE}/alerts/trending?days=${days}` | |
| ) | |
| return data.trending |
| screenshots_dir = Path( # noqa: E501 | ||
| "/home/nistrator/Documents/github/amplifier/ai_working/agent_debugger/docs/assets/gifs/screenshots" | ||
| ) |
There was a problem hiding this comment.
This hard-coded absolute path makes the script non-portable for other developers/CI. Prefer deriving the directory relative to the script location (e.g., using Path(__file__).resolve().parent) or accepting it as an argument/env var.
| screenshots_dir = Path( # noqa: E501 | |
| "/home/nistrator/Documents/github/amplifier/ai_working/agent_debugger/docs/assets/gifs/screenshots" | |
| ) | |
| screenshots_dir = Path(__file__).resolve().parent | |
| screenshots_dir.mkdir(parents=True, exist_ok=True) |
| className={`alert-row ${getStatusVariant(alert.status)}`} | ||
| onClick={() => setExpandedAlertId(expandedAlertId === alert.id ? null : alert.id)} |
There was a problem hiding this comment.
Clickable div elements without keyboard support are not accessible (no tab stop / Enter/Space activation / role). Use a semantic button for the row header interaction, or add role=\"button\", tabIndex={0}, and key handlers mirroring click behavior.
| className={`alert-row ${getStatusVariant(alert.status)}`} | |
| onClick={() => setExpandedAlertId(expandedAlertId === alert.id ? null : alert.id)} | |
| className={`alert-row ${getStatusVariant(alert.status)}`} | |
| role="button" | |
| tabIndex={0} | |
| aria-expanded={expandedAlertId === alert.id} | |
| onClick={() => setExpandedAlertId(expandedAlertId === alert.id ? null : alert.id)} | |
| onKeyDown={(e) => { | |
| if (e.key === 'Enter' || e.key === ' ') { | |
| e.preventDefault() | |
| setExpandedAlertId(expandedAlertId === alert.id ? null : alert.id) | |
| } | |
| }} |
| """ | ||
| self.policy_getter = policy_getter | ||
|
|
||
| def get_threshold(self, alert_type: str, agent_name: str | None = None, default_threshold: float = 0.0) -> float: |
There was a problem hiding this comment.
get_threshold() is synchronous but calls an injected policy_getter that will commonly need to be async (DB-backed) in this codebase. If an async callable is passed, policy becomes a coroutine and .get(...) will fail at runtime. Either require a synchronous policy getter explicitly (and type it accordingly), or make get_threshold async and support awaiting the getter (including Awaitable return types).
| if self.policy_getter: | ||
| policy = self.policy_getter(alert_type, agent_name) | ||
| if policy and policy.get("enabled", True): | ||
| return policy.get("threshold_value", default_threshold) |
There was a problem hiding this comment.
get_threshold() is synchronous but calls an injected policy_getter that will commonly need to be async (DB-backed) in this codebase. If an async callable is passed, policy becomes a coroutine and .get(...) will fail at runtime. Either require a synchronous policy getter explicitly (and type it accordingly), or make get_threshold async and support awaiting the getter (including Awaitable return types).
| # Wait for expiration | ||
| time.sleep(1.1) | ||
| assert cache.get("expiring_key") is None |
There was a problem hiding this comment.
These tests introduce real-time sleeps which will slow the suite and can be flaky on loaded CI runners. Prefer controlling time via a clock abstraction in QueryCache, monkeypatching time.time(), or using a time-freezing utility so TTL behavior is deterministic without sleeping.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4489d8455
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self._cache.invalidate(f"alert_summary:{self.tenant_id}:") | ||
| self._cache.invalidate(f"trending_alerts:{self.tenant_id}:") |
There was a problem hiding this comment.
Invalidate actual alert summary cache keys
QueryCache.invalidate removes only exact keys, but _invalidate_summary_cache passes tenant prefixes (alert_summary:{tenant}: and trending_alerts:{tenant}:) and never touches the new keys used by get_alert_lifecycle_summary/get_alert_trending (lifecycle_summary:... and trending:...). After create/status updates, /api/alerts/summary and /api/alerts/trending can keep serving stale data for the full TTL window instead of reflecting the mutation immediately.
Useful? React with 👍 / 👎.
| const response = await fetch(`${API_BASE}/alerts/${alertId}/status`, { | ||
| method: 'PUT', | ||
| headers: { 'Content-Type': 'application/json' }, | ||
| body: JSON.stringify({ status, resolution_note: note }), |
There was a problem hiding this comment.
Send status update note under expected field name
The status update API expects note (see AlertStatusUpdate), but this request body sends resolution_note; FastAPI will ignore the unknown field, so notes entered in the dashboard are silently dropped on acknowledge/resolve/dismiss transitions. This breaks the lifecycle note/audit behavior even though the status change itself succeeds.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| export async function fetchAlertTrending(days: number = 7) { | ||
| return fetchJSON<AlertTrendingPoint[]>(`${API_BASE}/alerts/trending?days=${days}`) |
There was a problem hiding this comment.
Deserialize trending response envelope before storing
The backend returns /api/alerts/trending as an object with a trending array, but this client call types and parses it as a bare AlertTrendingPoint[]. The hook then stores an object in array state, so checks like trending.length > 0 fail and the trending chart never renders even when data exists.
Useful? React with 👍 / 👎.
| <option value="critical">Critical</option> | ||
| <option value="high">High</option> | ||
| <option value="medium">Medium</option> | ||
| <option value="low">Low</option> |
There was a problem hiding this comment.
Use numeric severity values for alert filter options
The alerts endpoint defines severity as a float query parameter (0.0–1.0), but these options submit string labels (critical/high/medium/low). Selecting one sends an invalid query value, which yields a 422 response and empties the alert list for that filter choice.
Useful? React with 👍 / 👎.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f5e249c0b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| NotFoundError: if alert not found | ||
| ValueError: if status is invalid | ||
| """ | ||
| alert = await repo._alert_repo.update_alert_status(alert_id, update.status, update.note) |
There was a problem hiding this comment.
Map invalid alert statuses to client errors
This route forwards update.status directly to AnomalyAlertRepository.update_alert_status, which raises ValueError for unknown statuses. Because the exception is not handled here, invalid input like "closed" is surfaced as a 500 from the global exception handler instead of a 4xx validation error, which makes client mistakes look like server outages. Constrain status to an enum/Literal (or catch and translate ValueError) so bad statuses return 422/400.
Useful? React with 👍 / 👎.
|
|
||
| export async function fetchAlertPolicies(agentName?: string) { | ||
| const params = agentName ? `?agent_name=${encodeURIComponent(agentName)}` : '' | ||
| return fetchJSON<AlertPolicy[]>(`${API_BASE}/alert-policies${params}`) |
There was a problem hiding this comment.
Deserialize alert policy list response envelope
/api/alert-policies returns an object (AlertPolicyListResponse) with policies and total, but this client function casts the whole payload to AlertPolicy[]. Any consumer that treats the return value as an array will fail at runtime (e.g., .map/.length on the wrong shape) and never display policies.
Useful? React with 👍 / 👎.
| > | ||
| <option value="">All Types</option> | ||
| <option value="oscillation">Oscillation</option> | ||
| <option value="looping">Looping</option> |
There was a problem hiding this comment.
Align looping filter value with stored alert_type
The Looping filter submits alert_type=looping, but persisted loop alerts use alert_type="tool_loop" (see collector/alerts/tool_loop.py). Selecting this option therefore filters out real loop alerts and shows an empty result set even when matching alerts exist.
Useful? React with 👍 / 👎.
- Add prefix-based cache invalidation to QueryCache
- Fix cache key to include limit parameter in trending queries
- Expand cache invalidation to cover all aggregate cache families
- Fix frontend to send 'note' instead of 'resolution_note' in status updates
- Fix trending API response unwrapping (unwrap .trending from envelope)
- Fix bulk status return type to match backend ({ updated, status })
- Change ManagedAlert.severity from RiskLevel to number, add severityLabel helper
- Use numeric severity values in filter dropdown (matching backend float param)
- Add accessibility (role, tabIndex, onKeyDown) to alert rows
- Make AlertDeriver.get_threshold async-safe with coroutine detection
- Use UNSET sentinel in policy_repo.update_policy for nullable field support
- Rename ix_sessions_created_at to ix_sessions_started_at in migration 008
- Replace real sleeps with monkeypatched time in cache tests
- Fix hard-coded path in capture_search.py (use Path(__file__).parent)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed7b4ea3b8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| policy = await repo.update_policy( | ||
| policy_id=policy_id, | ||
| agent_name=data.agent_name, | ||
| alert_type=data.alert_type, | ||
| threshold_value=data.threshold_value, | ||
| severity_threshold=data.severity_threshold, | ||
| enabled=data.enabled, |
There was a problem hiding this comment.
Preserve unspecified policy fields on update
AlertPolicyUpdate makes every field optional, but this handler always forwards data.agent_name, data.alert_type, data.threshold_value, etc. to repo.update_policy. For omitted fields, Pydantic supplies None, which bypasses the repository’s _UNSET sentinel and overwrites existing values (including setting non-null columns like alert_type to None, causing a commit-time integrity error/500 on partial updates). This breaks the partial-update contract used by updateAlertPolicy(..., Partial<AlertPolicy>) and can unintentionally clear policy data.
Useful? React with 👍 / 👎.
Summary
/api/alert-policiesCRUD endpoints, scoped per tenantAlertPolicyModel,AlertPolicyRepository, query cache, and 3 new migrations (006 lifecycle columns, 007 policies table, 008 performance indexes)AlertPolicy/alert lifecycle types intypes/index.ts, API client methods, anduseAlerts/useAlertSummaryReact hooksMigrations
006_add_alert_lifecycle007_add_alert_policies008_add_alert_indexesTest plan
tests/test_alert_lifecycle.py— state transitions, bulk ops, history tracking (365 lines)tests/test_alert_policies.py— policy CRUD, tenant isolation, filtering (268 lines)tests/test_query_performance.py— index presence, query plan assertions (250 lines)ruff check .— all checks passed🤖 Generated with Claude Code